Goto

Collaborating Authors

 presentation slide


Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

arXiv.org Artificial Intelligence

State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.


Index-MSR: A high-efficiency multimodal fusion framework for speech recognition

arXiv.org Artificial Intelligence

ABSTRACT Driven by large-scale datasets and LLM-based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public A VSR dataset demonstrate that Index-MSR achieves state-of-the-art accuracy, with substitution errors reduced by 20-50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio-text synchronization, such as audio translation.


Seeing Like a Designer Without One: A Study on Unsupervised Slide Quality Assessment via Designer Cue Augmentation

arXiv.org Artificial Intelligence

--We present an unsupervised slide-quality assessment pipeline that combines seven expert-inspired visual-design metrics (whitespace, colorfulness, edge density, brightness contrast, text density, color harmony, layout balance) with CLIP-ViT embeddings, using Isolation Forest-based anomaly scoring to evaluted presentation slides. Trained on 12k professional lecture slides and evaluated on six academic talks (115 slides), our method achieved Pearson correlations up to 0.83 with human visual-quality ratings--1.79 to 3.23 stronger than scores from leading vision-language models (ChatGPT o4-mini-high, Chat-GPT o3, Claude Sonnet 4, Gemini 2.5 Pro). We demonstrate convergent validity with visual ratings, discriminant validity against speaker-delivery scores, and exploratory alignment with overall impressions. Our results show that augmenting low-level design cues with multimodal embeddings closely approximates audience perceptions of slide quality, enabling scalable, objective feedback in real time. Slideware such as PowerPoint, Keynote and Google Slides has become the primary visual channel in classrooms, boardrooms and pitch competitions.


DesignLab: Designing Slides Through Iterative Detection and Correction

arXiv.org Artificial Intelligence

Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.


Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

arXiv.org Artificial Intelligence

Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/


PASS: Presentation Automation for Slide Generation and Speech

arXiv.org Artificial Intelligence

In today's fast-paced world, effective presentations have become an essential tool for communication in both online and offline meetings. The crafting of a compelling presentation requires significant time and effort, from gathering key insights to designing slides that convey information clearly and concisely. However, despite the wealth of resources available, people often find themselves manually extracting crucial points, analyzing data, and organizing content in a way that ensures clarity and impact. Furthermore, a successful presentation goes beyond just the slides; it demands rehearsal and the ability to weave a captivating narrative to fully engage the audience. Although there has been some exploration of automating document-to-slide generation, existing research is largely centered on converting research papers. In addition, automation of the delivery of these presentations has yet to be addressed. We introduce PASS, a pipeline used to generate slides from general Word documents, going beyond just research papers, which also automates the oral delivery of the generated slides. PASS analyzes user documents to create a dynamic, engaging presentation with an AI-generated voice. Additionally, we developed an LLM-based evaluation metric to assess our pipeline across three critical dimensions of presentations: relevance, coherence, and redundancy. The data and codes are available at https://github.com/AggarwalTushar/PASS.


Increasing the Accessibility of Causal Domain Knowledge via Causal Information Extraction Methods: A Case Study in the Semiconductor Manufacturing Industry

arXiv.org Artificial Intelligence

The extraction of causal information from textual data is crucial in the industry for identifying and mitigating potential failures, enhancing process efficiency, prompting quality improvements, and addressing various operational challenges. This paper presents a study on the development of automated methods for causal information extraction from actual industrial documents in the semiconductor manufacturing industry. The study proposes two types of causal information extraction methods, single-stage sequence tagging (SST) and multi-stage sequence tagging (MST), and evaluates their performance using existing documents from a semiconductor manufacturing company, including presentation slides and FMEA (Failure Mode and Effects Analysis) documents. The study also investigates the effect of representation learning on downstream tasks. The presented case study showcases that the proposed MST methods for extracting causal information from industrial documents are suitable for practical applications, especially for semi structured documents such as FMEAs, with a 93\% F1 score. Additionally, MST achieves a 73\% F1 score on texts extracted from presentation slides. Finally, the study highlights the importance of choosing a language model that is more aligned with the domain and in-domain fine-tuning.


Google is quietly building an omnipresent AI that will be linked to all your devices and apps - and 'knows everything about your life'

Daily Mail - Science & tech

Confidential documents presented at a recent internal Google summit detail the tech giant's plan to create an artificial intelligence (AI) designed to become its users' 'Life Story Teller.' But to do it, the AI will require unprecedented access to each user's personal data. It's unclear where this experimental AI, currently dubbed'Project Ellmann,' will reside among Google's apps and services, but the team behind it works for Google Photos -- and their presentation suggested a tailored AI chatbot. 'We can't answer tough questions or tell good stories without a bird's-eye view of your life,' read one portion of the presentation, made by a Google product manager. Confidential documents presented at a recent internal Google summit detail the tech giant's plan to create an AI designed to become their users' 'Life Story Teller.' Building off the company's ChatGPT rival Gemini, it new project will scrape reams of a user's personal data Building off the company's ChatGPT rival Gemini, Project Ellmann will use'large language models' (LLMs) to synthesize personal information from context said to include biographies of users and their loved ones, as well as stored photo'moments.' But the new developments may spark alarm from those outraged by Google's secret collection of millions of individual's sensitive medical records, code-named Project Nightingale in 2019 -- or anyone who eagerly collects digital privacy tips.


The 7 Reasons Most Machine Learning Funds Fail (Presentation Slides)

#artificialintelligence

The rate of failure in quantitative finance is high, and particularly so in financial machine learning. The few managers who succeed amass a large amount of assets, and deliver consistently exceptional performance to their investors. However, that is a rare outcome, for reasons that will become apparent in this presentation. Over the past two decades, I have seen many faces come and go, firms started and shut down. In my experience, there are 7 critical mistakes underlying most of those failures. This paper is partly based on the book Advances in Financial Machine Learning (Wiley, 2018).


Extractive Research Slide Generation Using Windowed Labeling Ranking

arXiv.org Artificial Intelligence

Presentation slides describing the content of scientific and technical papers are an efficient and effective way to present that work. However, manually generating presentation slides is labor intensive. We propose a method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.